Preserving Categorical Data Analysis
نویسندگان
چکیده
Ling Guo. Randomization Based Privacy Preserving Categorical Data Analysis. Under the direction of Dr. Xintao Wu The success of data mining relies on the availability of high quality data. To ensure quality data mining, effective information sharing between organizations becomes a vital requirement in today’s society. Since data mining often involves sensitive information of individuals, the public has expressed a deep concern about their privacy. Privacypreserving data mining is a study of eliminating privacy threats while, at the same time, preserving useful information in the released data for data mining. This dissertation investigates data utility and privacy of randomization-based models in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for association rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis. We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective association measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information. Data privacy and data utility are commonly considered as a pair of conflicting requirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient solutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from
منابع مشابه
Privacy Preserving Categorical Data Analysis with Unknown Distortion Parameters
Randomized Response techniques have been investigated in privacy preserving categorical data analysis. However, the released distortion parameters can be exploited by attackers to breach privacy. In this paper, we investigate whether data mining or statistical analysis tasks can still be conducted on randomized data when distortion parameters are not disclosed to data miners. We first examine h...
متن کاملAn Effective Data Transformation Approach for Privacy Preserving Clustering
A new stream of research privacy preserving data mining emerged due to the recent advances in data mining, Internet and security technologies. Data sharing among organizations considered to be useful which offer mutual benefit for business growth. Preserving the privacy of shared data for clustering was considered as the most challenging problem. To overcome the problem, the data owner publishe...
متن کاملVICUS - A Noise Addition Technique for Categorical Data
Privacy preserving data mining and statistical disclosure control have received a great deal of attention during the last few decades. Existing techniques are generally classified as restriction and data modification. Within data modification techniques noise addition has been one of the most widely studied but has traditionally been applied to numerical values, where the measure of similarity ...
متن کاملAnalysis of Dynamic Longitudinal Categorical Data in Incomplete Contingency Tables Using Capture-Recapture Sampling: A case Study of Semi-Concentrated Doctoral Exam
Abstract. In this paper, dynamic longitudinal categorical data and estimation of their parameters in incomplete contingency tables are evaluated. To apply the proposed method, a study has been conducted on the data of the semi-concentrated doctoral exam of the National Organization for Educational Testing (NOET). The results of studies such as the obtained confidence intervals and calculating t...
متن کاملPreserving Micro Data Release: Categorical and Numerical Data
Data mining techniques, in spite of their benefit in a wide range of applications have also raised threat to privacy and data security. All the attributes in a data base table can be classified into three categories as identifying attributes, sensitive attributes and quasi-identifier attributes. KAnonymity is the popular approach for privacy preserving data mining and the problems with Kanonymi...
متن کامل